Study on Phrases Used for Semi-automatic Text-Based Speakers Names Extraction in the Czech Radio Broadcasts News

نویسندگان

  • Michaela Kucharová
  • Svatava Skodová
  • Ladislav Seps
  • Marek Bohac
چکیده

In this paper we introduce a methodology leading to the extension of speakers’ database used in the process of automatic transcription of spoken documents stored in the largest Czech Radio audio archive. We address the issue of the conversion of spoken speech to written texts – the automatic detection of speakers and their names. We work with a subset of the archive that consists of 8,020 hours of broadcasting news and 58,914,179 words within the years 1968– 2011. We observed the occurrence of thousands of speakers’ names during the period and therefore it is necessary to use their automatic or semi-automatic identification. Another investigated issue leading to the extension of speakers’ database is the co-occurrence of a speaker’s name in a specific phrase in the text transcription linked with the speaker’s change in the audio recording.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploitation of Named Entities in Automatic Text Summarization for Swedish

Background The technique of automatic text summarization has been developed for many years (Luhn 1959, Edmundson 1969 and Salton 1989). One way to do text summarization is by text extraction, which means to extract pieces of an original text on a statistical basis or with heuristic methods and put them together to a new shorter text with as much information as possible preserved (Mani & Maybury...

متن کامل

بهبود خلاصه سازی خودکار متون فارسی با استفاده از روش‌های پردازش زبان طبیعی و گراف شباهت

A significant amount of available information is stored in textual databases which contains a large collection of documents from different sources (such as news, articles, books, emails and web pages). The increasing visibility and importance of this class of information motivates us to work on having better automatic evaluation tools for textual resources. The automatic summarization of tex...

متن کامل

Radio : Content Filtering and Delivery for Broadcast Audio

Synthetic News Radio uses automatic speech recognition and clustered text news stories to automatically find story boundaries in an audio news broadcast, and it creates semantic representations that can match stories of similar content through audio-based queries. Current speech recognition technology cannot by itself produce enough information to accurately characterize news audio; therefore, ...

متن کامل

Information Extraction from Broadcast News

This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular we concentrate on statistical finite state models for identifying proper names and other named entities in broadcast speech. Two models are presented: the first represents name class information as a word attribute; the second represents both word-...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014